Text Similarity Detection in Web Scraping Data Cleaning (Part 1)

In web scraping of news data, it is common for the same content or topic-based articles to be published on different websites or media outlets.

Thus, there is a need to determine whether two articles are similar. If the similarity is too high, the two articles may be associated under the same topic, or even one of the similar articles may be directly discarded. This article will introduce common data cleaning techniques in the market, as well as algorithms and cases for calculating the similarity between two articles.

Levenshtein Distance

Levenshtein distance (also known as edit distance) is a measure of the degree of difference between two strings. It represents the minimum number of single-character operations required to convert one string into another. These operations include three types:

Insertion: Adding a character to the string (e.g., changing “abc” to “abdc” by inserting “d”)
Deletion: Removing a character from the string (e.g., changing “abc” to “ac” by deleting “b”)
Replacement: Replacing a character in the string with another character (e.g., changing “abc” to “adc” by replacing “b” with “d”)

Calculation Example

Take the strings “kitten” and “sitting” as an example:

“kitten” → “sitten” (replace “k” with “s”)
“sitten” → “sittin” (replace “e” with “i”)
“sittin” → “sitting” (insert “g”)

A total of 3 operations are needed, so their Levenshtein distance is 3.

There is a similar library in Python called python-Levenshtein.

Installation:

pip install python-Levenshtein

Then we use the following code to determine the similarity between two sentences:

text_a = "The quick brown fox jumps over the lazy dog"
text_b = "The quick brown fox jumps over the sleepy dog"

The usage is as follows:

from Levenshtein import distance  # Need to install python-Levenshtein

def text_similarity_simple(text1, text2):
    # Calculate edit distance
    edit_dist = distance(text1, text2)
    # Normalize similarity (ranges from 0 to 1, where 1 means identical)
    max_len = max(len(text1), len(text2))
    return 1 - edit_dist / max_len if max_len > 0 else 1.0

# Examples

text_a = "The quick brown fox jumps over the lazy dog"
text_b = "The quick brown fox jumps over the sleepy dog"

print(f"Similarity between A and B: {text_similarity_simple(text_a, text_b):.2f}")  # Approximately  0.91

It can be seen that the edit distance similarity of the two sentences output by the above code is 0.91, which is very similar.

Levenshtein Edit Distance, as a classic algorithm for measuring string similarity, is widely used in scenarios such as text error correction and simple matching. However, due to its design logic, it has obvious flaws, especially when dealing with complex semantics, long texts, or specific domain requirements. The following analyzes its defects in detail from 6 core dimensions and illustrates the limitations of applicable scenarios with examples:

It ignores the “semantic relevance” of characters/words and only focuses on surface forms.

The essence of Levenshtein distance is counting “mechanical editing operations” (insertion, deletion, replacement) based on characters. It completely ignores the semantic logic behind characters or words. Even if two words/sentences have exactly the same meaning, if their surface characters differ greatly, the calculated distance will be high (and the similarity will be low); conversely, content with similar characters but irrelevant semantics may be mistakenly judged as “similar”.

For example:

Word A: “apple” (a kind of fruit)
Word B: “apply” (a verb meaning to request)

The two words differ only in the last character (e/y), and the Levenshtein distance is only 1 (replacement operation). The algorithm will judge them as “highly similar”, but in fact, their semantics are completely irrelevant.

TF-IDF + Cosine Similarity —— Word Frequency-Level Matching

This is another algorithm for judging the similarity between two sentences or articles.

TF-IDF: Quantifies the “importance” of words in the text —— TF (Term Frequency) refers to the number of times a word appears in the text, and IDF (Inverse Document Frequency) refers to the frequency of the word in the “entire document library” (the rarer it is, the higher the IDF and the stronger the importance).
Cosine Similarity: Converts the TF-IDF results of two texts into “vectors” and judges the similarity by calculating the cosine value of the angle between the vectors (ranging from 0 to 1). The smaller the angle, the closer the cosine value is to 1, and the higher the similarity.

The core function of TF-IDF is to quantify the “importance weight” of each word in the text and use this as a basis to represent the text as a vector corresponding to “word-weight” (i.e., “word vector”).

It is obtained by multiplying two core indicators: TF (Term Frequency) and IDF (Inverse Document Frequency).

1. Step 1: TF (Term Frequency)

TF measures how frequently a word appears in a single document. Theoretically, the more times a word appears in a document, the greater its “semantic contribution” to the document may be (for example, “football” appears frequently in articles discussing sports and is a core word).

Calculation Formula

The most basic way to calculate TF is:
TF(t, d) = Number of times word t appears in document d / Total number of words in document d

Example

Suppose document d = "The cat chases the mouse", after word segmentation, it becomes ["The", "cat", "chases", "The", "mouse"]:

Total number of words = 5
“The” appears 2 times → TF (The, d) = 2/5 = 0.4
“cat” appears 1 time → TF (cat, d) = 1/5 = 0.2
“chases” appears 1 time → TF (chases, d) = 0.2
“mouse” appears 1 time → TF (mouse, d) = 0.2

2. Step 2: IDF (Inverse Document Frequency)

TF has a defect: high-frequency words are not necessarily “important words”. For example, English words like “The”, “a”, and “is” (stop words) appear frequently in almost all documents but make no actual contribution to the text semantics.

The role of IDF is to penalize words that appear frequently in the “entire document collection (Corpus)” and highlight “rare words” that appear in a few documents (rare words are often core features of a certain type of document; for example, “blockchain” only appears in technology documents and has a high IDF value).

Calculation Formula

IDF(t) = log( Total number of documents in the collection / (Number of documents containing word t + 1) )

Adding 1 to the denominator is to avoid division by zero errors when “the number of documents containing word t is 0” (i.e., “smoothing processing”).
The role of the log function is to “mitigate” the growth rate of the IDF value and avoid the impact of extreme values.

Example

Assume the document collection (Corpus) has 1000 documents:

There are 990 documents containing “The” → IDF (The) = log (1000/(990+1)) ≈ log (1.01) ≈ 0.004 (very low value, penalizing the word)
There are 5 documents containing “blockchain” → IDF (blockchain) = log (1000/(5+1)) ≈ log (166.67) ≈ 2.22 (very high value, highlighting the word)

3. Finally: TF-IDF Weight

Multiply TF by IDF to get the “final importance weight” of each word in a specific document:
TF-IDF(t, d) = TF(t, d) × IDF(t)

Calculation Formula

For high-dimensional vectors (text vectors are usually high-dimensional), the cosine similarity formula is:
cos(θ) = (A · B) / (||A|| × ||B||)

A · B: The dot product of vectors A and B (sum of products of corresponding dimension values);
||A||: The magnitude of vector A (square root of the sum of squares of each dimension value);
||B||: The magnitude of vector B.

3. Example: Calculating the Similarity of Two English Sentences

Suppose we have two sentences, obtain their vectors based on TF-IDF, and then calculate the similarity using cosine similarity:

Step 1: Define Sentences and Vocabulary

Sentence 1 (A): “I love reading books” → TF-IDF vector: [0.2, 0.8, 0.7, 0.6, 0, 0]
Sentence 2 (B): “I enjoy reading novels” → TF-IDF vector: [0.2, 0, 0.7, 0, 0.8, 0.6]
Vocabulary: ["I", "love", "reading", "books", "enjoy", "novels"]

Step 2: Calculate the Dot Product (A・B)

A · B = (0.2×0.2) + (0.8×0) + (0.7×0.7) + (0.6×0) + (0×0.8) + (0×0.6) = 0.04 + 0 + 0.49 + 0 + 0 + 0 = 0.53

Step 3: Calculate the Magnitudes (||A|| and ||B||)

||A|| = √(0.2² + 0.8² + 0.7² + 0.6² + 0² + 0²) = √(0.04 + 0.64 + 0.49 + 0.36) = √1.53 ≈ 1.237
||B|| = √(0.2² + 0² + 0.7² + 0² + 0.8² + 0.6²) = √(0.04 + 0 + 0.49 + 0 + 0.64 + 0.36) = √1.53 ≈ 1.237

Step 4: Calculate Cosine Similarity

cos(θ) = 0.53 / (1.237 × 1.237) ≈ 0.53 / 1.53 ≈ 0.346

Result interpretation: The similarity is approximately 0.35, indicating that the two sentences are somewhat relevant (both involve “reading”) but not completely similar (one is “loving reading books”, and the other is “enjoying reading novels”).

The following uses Python code for testing:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_similarity(sentences):
    # Initialize TF-IDF Vectorizer
    vectorizer = TfidfVectorizer(stop_words='english')  # Remove common English stop words

    # Transform sentences into TF-IDF vectors
    tfidf_matrix = vectorizer.fit_transform(sentences)

    # Get feature names (words) for reference
    feature_names = vectorizer.get_feature_names_out()

    # Calculate cosine similarity between the first sentence and others
    similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:]).flatten()

    return similarities, feature_names,tfidf_matrix

# Example sentences
sentences = [
    "I love reading books",
    "I enjoy reading novels",
]

# Calculate similarities
similarities, features,tfidf_matrix = calculate_similarity(sentences)

# Display results
print(f"Base sentence: {sentences[0]}\n")
for i, similarity in enumerate(similarities, 1):
    print(f"Sentence {i}: {sentences[i]}")
    print(f"Similarity score: {similarity:.4f}\n")

# Show some important words (high TF-IDF in base sentence)
print("Key words from base sentence (with significant TF-IDF weights):")
base_vector = tfidf_matrix[0].toarray()[0]
top_indices = base_vector.argsort()[-5:][::-1]  # Top 5 words
for idx in top_indices:
    print(f"- {features[idx]}: {base_vector[idx]:.4f}")

Running the above code, the output result is:

Base sentence: I love reading books

Sentence 1: I enjoy reading novels
Similarity score: 0.2020

Key words from base sentence (with significant TF-IDF weights):
- love: 0.6317
- books: 0.6317
- reading: 0.4494
- novels: 0.0000
- enjoy: 0.0000

In fact, the similarity between the above two sentences is quite high, but because there is no preprocessing for synonyms, “love” and “enjoy” are judged as two dissimilar words. Similarly, “book” and “novels” are also judged to have a similarity of 0.

Therefore, the similarity score of the above two sentences is only 0.2020, which is considered not very similar.

It is only based on statistical characteristics of words and completely ignores the semantics, grammar, and contextual relationships of the text. As a result, it cannot understand “semantic relevance”. For example, “love” and “enjoy” mentioned above are not recognized as similar. It is also completely insensitive to “word order and grammar”. For example:

Sentence 1: The cat chases the dog
Sentence 2: The dog chases the cat

It is interfered by “low-frequency words/rare words” and relies on “large-scale, high-quality corpora”. For example, in the “sports news” corpus, Sentence 1 contains “basketball” (a high-frequency common word with low IDF), and Sentence 2 contains “kyrie” (a basketball player’s name, a low-frequency rare word with high IDF). If the two sentences differ only in this word, TF-IDF will mistakenly consider the two sentences to have extremely low similarity due to the high weight of “kyrie”, ignoring a large amount of other overlapping semantic information.

However, because Levenshtein distance and TF-IDF + cosine similarity are relatively simple to calculate and have fast operation speeds, they are also widely used in large-scale data cleaning.

More advanced text similarity detection technologies will be explained in more depth in the next tutorial.